
CLAIMS 



1. A method comprising: 

developing a ^language model from a tuning set of infomiation; 
segmenting at least a subset of a received textual corpus and calculating a 
perplexity value for each segment; 

refining the language model with one or more segments of the received 
corpus based, at least in part, on the calculated perplexity value for the one or more 
segments. 

\ 

% 

2. A method according to claim 1, wherein the tuning set of information 

is application specific. \ 

\ 

3. A method according to claim 1, wherein the tuning set of information 

is comprised of one or more application-specific documents. 

\ 

\ 

4. A method according to claim 1 , wherein the tuning set of information 

is a highly accurate set of textual information linguistically relevant to, but not 

\ 

taken from, the received textual corpus. 



5. A method according! to claim 1, further comprising a training set 

~~~ 

6. A method according to claim 5, further comprising: 

ranking the segments of the draining set based, at least in part, on the 
calculated perplexity value for each segment. 
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\ 

7. A method according to claim 1, wherein segmenting at least the 

subset of the received corpus>€omprises: 

\ 

clustering every N-iterh^ft^'e received corpus into a training unit, wherein 
resultant training units are separated by gaps; 

calculate the similarity witnin a sequence of training chunks on either side 
of each of the gaps; and 

select segment boundaries that\maximize intra-segment similarity and inter- 
segment disparity. 



Z8. A method according to claim 7, wherein the resultant segment defines 
raining chunk. \^ 

\ 

9. A method according to claim 7, wherein N is an empirically derived 

value based, at least in part\ on the size of the received corpus. 

\ 

\ 

\ 

10. A method according to claim 7, wherein the calculation of the 
similarity within a sequence of;training units defines a cohesion score. 

\ 

\ 

11. A method according to claim 10, wherein intra-segment similarity is 

\ 

measured by the cohesion score. ^ 

\ 

\ 

12. A method according to claim 7, wherein inter-segment disparity is 
approximated from the cohesion score. 

\ 
\ 
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13. A method according to claim 7, wherein the calculation of inter- 
segment disparity defines a depth score. 



14. A method according to claim 1, wherein the perplexity value is a 
measure of the predictive powei* of a certain language model to a segment of the 
received corpus. 



15. A method according to claim 1, further comprising: 
ranking the segments of ai least the subset of the received corpus based, at 
least in part, on the calculated perplexity value of each segment; and 

! 

updating the tuning set of information with one or more of the segments 
from at least the subset of the received corpus. 



16. A method accordirig to claim 15, wherein one or more of the 
segments with the lowest perplexity value from at least the subset of the received 
corpus are added to the tuning set. 

17. A method according to claim 1, further comprising: 
utilizing the refined language model in an application to predict a likelihood 

of another corpus. 

18. A storage medium comprising a plurality of executable instructions 
including at least a subset of which, when executed, implement a method 
according to claim 1 . 



V 
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19. A system comprising: 

I 

a storage medium hjaving stored therein a plurality of executable 
instructions; and j 

an execution unit, coupled to the storage medium, to execute at least a 
subset of the plurality of executable instructions to implement a method according 
to claim 1 . » 



20. A storage medium comprising a plurality of executable instructions 

I 

which, when executed, implement a language modeling agent to develop a 
language model from a tuning set of information, to segment at least a subset of a 
received textual corpus and calculate a perplexity value for each segment, and to 
refme the language model with one or more segments of the received corpus 
based, at least in part, on the calculated perplexity value for the one or more 
segments. 



21. A storage medium according to claim 20, wherein the language 
modeling agent utilizes a tuningjset of information relevant to that of the received 
corpus. 



22. A storage medium I according to claim 20, wherein the language 
modeling agent ranks the segments of the training set based, at least in part, on a 
measure of similarity between two 6r more segments. 

23. A storage medium according to claim 22, wherein the similarity 
measure is calculated for adjacent segments. 
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I 

24. A storage medium according to claim 20, wherein the language 

I 

modeling agent segments^e received corpus by clustering every N items of the 

\ M 

received corpus into a trainingAinit, wherein the training units are separated by 
gaps, calculating the similarity within a sequence of training units on either side of 

each of the gaps, and selecting s^egment boundaries that improve intra-segment 

] 

similarity and inter-segment disparity. 

25. A storage medium according to claim 20, further comprising 
instructions to implement an application which selectively invokes the language 
modeling agent to predict a likelihood of another corpus. 



26. A storage medium according to claim 25, wherein the application is 
one or more of a spelling and/or grammar checker, a word-processor, a speech 
recognition application,^ language translation application, and the like. 



27. A system comprising: 

a storage medium dr^ive, to removable receive a storage medium according 
to claim 20; and 

an execution unit, coup|^ed to the storage medium drive, to execute at least a 
subset of the plurality of instructions and implement the language modeling agent. 



28. A modeling agent comprising: 

invpc 



a controller, to receive invpcation requests to develop a language model 
from a corpus; and 

a data structure generator, Vesponsive to the controller, to develop a 
language model from a tuning set of information, segment at least a subset of the 
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received corpus, calculate a perplexity value for each segment, and refine the 



language mode with one or mon 
part, on the calculated perplexity 



segments of the received corpus based, at least in 
value. 



29. A modeling agent according to claim 28, wherein the tuning set is 
dynamically selected as relevant to the received corpus. 



30, A modeling agent acc'prding to claim 28, the data structure generator 
comprising: 

a dynamic lexicon generation\ function, to develop an initial lexicon from 
the tuning set, and to update the lexicon with select segments from the received 
corpus. 1 



31. A modeling agent according to claim 28, the data structure generator 



comprismg: 

a frequency analysis function, to\ determine a frequency of occurrence of 
segments within the received corpus. ^ 



32. A modeling agent according jto claim 28, the data structure generator 
comprising: 

a dynamic segmentation function, to iteratively segment the received corpus 
to improve a predictive performance attribute\3f the modeling agent. 
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33. A modeling agent according to claim 32, wherein the dynamic 



segmentation function iteratively 



re-segments the received corpus until the 



language model reaches an acceptable threshold 



\ 



34. A modeling agent according to claim 32, the data structure generator 
further comprising: 

a fi-equency analysis function,] to determine a fi-equency of occurrence of 
segments within the received corpus. 



35. A modeling agent accordmg to claim 34, wherein the data structure 
generator selectively removes segments from the data structure that do not meet a 
minimum frequency threshold, and dynamically re-segments the received corpus to 
improve predictive capability while reducing the size of the data structure. 
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